{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classification with Support Vector Machines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### by Soeren Sonnenburg | Saurabh Mahindre - <a href=\\\"https://github.com/Saurabh7\\\">github.com/Saurabh7</a> as a part of <a href=\\\"http://www.google-melange.com/gsoc/project/details/google/gsoc2014/saurabh7/5750085036015616\\\">Google Summer of Code 2014 project</a> mentored by - Heiko Strathmann - <a href=\\\"https://github.com/karlnapf\\\">github.com/karlnapf</a> - <a href=\\\"http://herrstrathmann.de/\\\">herrstrathmann.de</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook illustrates how to train a <a href=\"http://en.wikipedia.org/wiki/Support_vector_machine\">Support Vector Machine</a> (SVM) <a href=\"http://en.wikipedia.org/wiki/Statistical_classification\">classifier</a> using Shogun. The <a href=\"http://www.shogun-toolbox.org/doc/en/3.0.0/classshogun_1_1CLibSVM.html\">CLibSVM</a> class of Shogun is used to do binary classification. Multiclass classification is also demonstrated using [CGMNPSVM](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGMNPSVM.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. [Introduction](#Introduction)\n",
    "2. [Linear Support Vector Machines](Linear-Support-Vector-Machines)\n",
    "  1. [Prediction using Linear SVM](#Prediction-using-Linear-SVM)\n",
    "3. [SVMs using kernels](#SVMs-using-kernels)\n",
    "  1. [Kernels in Shogun](#Kernels-in-Shogun)\n",
    "  2. [Prediction using kernel based SVM](#Prediction-using-kernel-based-SVM)\n",
    "4. [Probabilistic Outputs using SVM](#Probabilistic-Outputs)\n",
    "5. [Soft margins and slack variables](#Soft-margins-and-slack-variables)\n",
    "6. [Binary classification using different kernels](#Binary classification-using-different-kernels)\n",
    "7. [Kernel Normalizers](#Kernel-Normalizers)\n",
    "8. [Multiclass classification using SVM](#Multiclass-classification)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Support Vector Machines (SVM's) are a learning method used for binary classification. The basic idea is to find a hyperplane which separates the data into its two classes. However, since example data is often not linearly separable, SVMs operate in a kernel induced feature space, i.e., data is embedded into a higher dimensional space where it is linearly separable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Linear Support Vector Machines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a supervised learning problem, we are given a labeled set of input-output pairs $\\mathcal{D}=(x_i,y_i)^N_{i=1}\\subseteq \\mathcal{X} \\times \\mathcal{Y}$ where $x\\in\\mathcal{X}$ and $y\\in\\{-1,+1\\}$. [SVM](https://en.wikipedia.org/wiki/Support_vector_machine) is a binary classifier that tries to separate objects of different classes by finding a (hyper-)plane such that the margin between the two classes is maximized. A hyperplane in $\\mathcal{R}^D$ can be parameterized by a vector $\\bf{w}$ and a constant $\\text b$ expressed in the equation:$${\\bf w}\\cdot{\\bf x} + \\text{b} = 0$$\n",
    "Given such a hyperplane ($\\bf w$,b) that separates the data, the discriminating function is: $$f(x) = \\text {sign} ({\\bf w}\\cdot{\\bf x} + {\\text b})$$\n",
    "\n",
    "If the training data are linearly separable, we can select two hyperplanes in a way that they separate the data and there are no points between them, and then try to maximize their distance. The region bounded by them is called \"the margin\". These hyperplanes can be described by the equations\n",
    "$$({\\bf w}\\cdot{\\bf x} + {\\text b}) = 1$$\n",
    "$$({\\bf w}\\cdot{\\bf x} + {\\text b}) = -1$$\n",
    "the distance between these two hyperplanes is $\\frac{2}{\\|\\mathbf{w}\\|}$, so we want to minimize $\\|\\mathbf{w}\\|$.\n",
    "$$\n",
    "    \\arg\\min_{(\\mathbf{w},b)}\\frac{1}{2}\\|\\mathbf{w}\\|^2  \\qquad\\qquad(1)$$\n",
    "This gives us a hyperplane that maximizes the geometric distance to the closest data points.\n",
    "As we also have to prevent data points from falling into the margin, we add the following constraint: for each ${i}$ either\n",
    "$$({\\bf w}\\cdot{x}_i + {\\text b}) \\geq 1$$ or\n",
    "$$({\\bf w}\\cdot{x}_i + {\\text b}) \\leq -1$$\n",
    "which is similar to\n",
    "$${y_i}({\\bf w}\\cdot{x}_i + {\\text b}) \\geq 1   \\forall i$$\n",
    "\n",
    "[Lagrange multipliers](http://en.wikipedia.org/wiki/Lagrange_multiplier) are used to modify equation $(1)$ and the corresponding dual of the  problem can be shown to be:\n",
    "\n",
    "  \\begin{eqnarray*}\n",
    "       \\max_{\\bf \\alpha} && \\sum_{i=1}^{N} \\alpha_i - \\sum_{i=1}^{N}\\sum_{j=1}^{N} \\alpha_i y_i \\alpha_j y_j  {\\bf x_i} \\cdot {\\bf x_j}\\\\\n",
    "       \\mbox{s.t.} && \\alpha_i\\geq 0\\\\\n",
    "                   && \\sum_{i}^{N} \\alpha_i y_i=0\\\\\n",
    "\\end{eqnarray*}\n",
    "\n",
    "From the derivation of these equations, it was seen that the optimal hyperplane can be written as:\n",
    "$$\\mathbf{w} = \\sum_i \\alpha_i y_i \\mathbf{x}_i.  $$\n",
    "here most $\\alpha_i$ turn out to be zero, which means that the solution is a sparse linear combination of the training data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prediction using Linear SVM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let us see how one can train a linear Support Vector Machine with Shogun. Two dimensional data (having 2 attributes say: attribute1 and attribute2) is now sampled to demonstrate the classification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "import matplotlib.patches as patches\n",
    "#To import all shogun classes\n",
    "import modshogun as sg\n",
    "import numpy as np\n",
    "\n",
    "#Generate some random data\n",
    "X = 2 * np.random.randn(10,2)\n",
    "traindata=np.r_[X + 3, X + 7].T\n",
    "\n",
    "feats_train=sg.RealFeatures(traindata)\n",
    "\n",
    "trainlab=np.concatenate((np.ones(10),-np.ones(10)))\n",
    "labels=sg.BinaryLabels(trainlab)\n",
    "\n",
    "# Plot the training data\n",
    "plt.figure(figsize=(6,6))\n",
    "plt.gray()\n",
    "_=plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)\n",
    "plt.title(\"Training Data\")\n",
    "plt.xlabel('attribute1')\n",
    "plt.ylabel('attribute2')\n",
    "p1 = patches.Rectangle((0, 0), 1, 1, fc=\"k\")\n",
    "p2 = patches.Rectangle((0, 0), 1, 1, fc=\"w\")\n",
    "plt.legend((p1, p2), [\"Class 1\", \"Class 2\"], loc=2)\n",
    "plt.gray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Liblinear](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLibLinear.html), a library for large- scale linear learning focusing on SVM, is used to do the classification. It supports different [solver types](http://www.shogun-toolbox.org/doc/en/latest/namespaceshogun.html#a6e99d1864c93fc2d41b1fa0fc253f471)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#prameters to svm\n",
    "#parameter C is described in a later section.\n",
    "C=1\n",
    "epsilon=1e-3\n",
    "\n",
    "svm=sg.LibLinear(C, feats_train, labels)\n",
    "svm.set_liblinear_solver_type(sg.L2R_L2LOSS_SVC)\n",
    "svm.set_epsilon(epsilon)\n",
    "\n",
    "#train\n",
    "svm.train()\n",
    "w=svm.get_w()\n",
    "b=svm.get_bias()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We solve ${\\bf w}\\cdot{\\bf x} + \\text{b} = 0$ to visualise the separating hyperplane. The methods `get_w()` and `get_bias()` are used to get the necessary values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#solve for w.x+b=0\n",
    "x1=np.linspace(-1.0, 11.0, 100)\n",
    "def solve (x1):\n",
    "    return -( ( (w[0])*x1 + b )/w[1] )\n",
    "\n",
    "x2=map(solve, x1)\n",
    "\n",
    "#plot\n",
    "plt.figure(figsize=(6,6))\n",
    "plt.gray()\n",
    "plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)\n",
    "plt.plot(x1,x2, linewidth=2)\n",
    "plt.title(\"Separating hyperplane\")\n",
    "plt.xlabel('attribute1')\n",
    "plt.ylabel('attribute2')\n",
    "plt.gray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The classifier is now applied on a X-Y grid of points to get predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "size=100\n",
    "x1_=np.linspace(-5, 15, size)\n",
    "x2_=np.linspace(-5, 15, size)\n",
    "x, y=np.meshgrid(x1_, x2_)\n",
    "#Generate X-Y grid test data\n",
    "grid=sg.RealFeatures(np.array((np.ravel(x), np.ravel(y))))\n",
    "\n",
    "#apply on test grid\n",
    "predictions = svm.apply(grid)\n",
    "\n",
    "#Distance from hyperplane\n",
    "z=predictions.get_values().reshape((size, size))\n",
    "\n",
    "#plot\n",
    "plt.jet()\n",
    "plt.figure(figsize=(16,6))\n",
    "plt.subplot(121)\n",
    "plt.title(\"Classification\")\n",
    "c=plt.pcolor(x, y, z)\n",
    "plt.contour(x, y, z, linewidths=1, colors='black', hold=True)\n",
    "plt.colorbar(c)\n",
    "plt.gray()\n",
    "plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)\n",
    "plt.xlabel('attribute1')\n",
    "plt.ylabel('attribute2')\n",
    "plt.jet()\n",
    "\n",
    "#Class predictions\n",
    "z=predictions.get_labels().reshape((size, size))\n",
    "\n",
    "#plot\n",
    "plt.subplot(122)\n",
    "plt.title(\"Separating hyperplane\")\n",
    "c=plt.pcolor(x, y, z)\n",
    "plt.contour(x, y, z, linewidths=1, colors='black', hold=True)\n",
    "plt.colorbar(c)\n",
    "plt.gray()\n",
    "plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)\n",
    "plt.xlabel('attribute1')\n",
    "plt.ylabel('attribute2')\n",
    "plt.gray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SVMs using kernels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the data set is not linearly separable, a non-linear mapping $\\Phi:{\\bf x} \\rightarrow \\Phi({\\bf x}) \\in \\mathcal{F} $ is used. This maps the data into a higher dimensional space where it is linearly separable. Our equation requires only the inner dot products ${\\bf x_i}\\cdot{\\bf x_j}$. The equation can be defined in terms of inner products $\\Phi({\\bf x_i}) \\cdot \\Phi({\\bf x_j})$ instead. Since  $\\Phi({\\bf x_i})$ occurs only in dot products with $ \\Phi({\\bf x_j})$ it is sufficient to know the formula ([kernel function](http://en.wikipedia.org/wiki/Kernel_trick)) : $$K({\\bf x_i, x_j} ) =  \\Phi({\\bf x_i}) \\cdot \\Phi({\\bf x_j})$$ without dealing with the maping directly. The transformed optimisation problem is:\n",
    "\n",
    "\\begin{eqnarray*} \\max_{\\bf \\alpha} && \\sum_{i=1}^{N} \\alpha_i - \\sum_{i=1}^{N}\\sum_{j=1}^{N} \\alpha_i y_i \\alpha_j y_j k({\\bf x_i}, {\\bf x_j})\\\\ \\mbox{s.t.} && \\alpha_i\\geq 0\\\\ && \\sum_{i=1}^{N} \\alpha_i y_i=0 \\qquad\\qquad(2)\\\\ \\end{eqnarray*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Kernels in Shogun"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Shogun provides many options for the above mentioned kernel functions. [CKernel](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKernel.html) is the base class for kernels. Some commonly used kernels : \n",
    "\n",
    "* [Gaussian kernel](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGaussianKernel.html) : Popular Gaussian kernel computed as $k({\\bf x},{\\bf x'})= exp(-\\frac{||{\\bf x}-{\\bf x'}||^2}{\\tau})$\n",
    "\n",
    "* [Linear kernel](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLinearKernel.html) : Computes $k({\\bf x},{\\bf x'})= {\\bf x}\\cdot {\\bf x'}$\n",
    "* [Polynomial kernel](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CPolyKernel.html) : Polynomial kernel computed as $k({\\bf x},{\\bf x'})= ({\\bf x}\\cdot {\\bf x'}+c)^d$\n",
    "\n",
    "* [Simgmoid Kernel](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSigmoidKernel.html) : Computes $k({\\bf x},{\\bf x'})=\\mbox{tanh}(\\gamma {\\bf x}\\cdot{\\bf x'}+c)$\n",
    "\n",
    "Some of these kernels are initialised below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gaussian_kernel=sg.GaussianKernel(feats_train, feats_train, 100)\n",
    "#Polynomial kernel of degree 2\n",
    "poly_kernel=sg.PolyKernel(feats_train, feats_train, 2, True)\n",
    "linear_kernel=sg.LinearKernel(feats_train, feats_train)\n",
    "\n",
    "kernels=[linear_kernel, poly_kernel, gaussian_kernel]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just for fun we compute the kernel matrix and display it. There are clusters visible that are smooth for the gaussian and polynomial kernel and block-wise for the linear one. The gaussian one also smoothly decays from some cluster centre while the polynomial one oscillates within the clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.jet()\n",
    "def display_km(kernels, svm):\n",
    "    plt.figure(figsize=(20,6))\n",
    "    plt.suptitle('Kernel matrices for different kernels', fontsize=12)\n",
    "    for i, kernel in enumerate(kernels):\n",
    "        plt.subplot(1, len(kernels), i+1)\n",
    "        plt.title(kernel.get_name())\n",
    "        km=kernel.get_kernel_matrix()\n",
    "        plt.imshow(km, interpolation=\"nearest\")\n",
    "        plt.colorbar()\n",
    "\n",
    "display_km(kernels, svm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prediction using kernel based SVM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we train an SVM with a Gaussian Kernel. We use LibSVM but we could use any of the [other SVM](http://www.shogun-toolbox.org/doc/en/current/classshogun_1_1CSVM.html) from Shogun. They all utilize the same kernel framework and so are drop-in replacements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "C=1\n",
    "epsilon=1e-3\n",
    "svm=sg.LibSVM(C, gaussian_kernel, labels)\n",
    "_=svm.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could now check a number of properties like what the value of the objective function returned by the particular SVM learning algorithm or the explictly computed primal and dual objective function is"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "libsvm_obj=svm.get_objective()\n",
    "primal_obj, dual_obj=svm.compute_svm_primal_objective(), svm.compute_svm_dual_objective()\n",
    "\n",
    "print libsvm_obj, primal_obj, dual_obj"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and based on the objectives we can compute the duality gap (have a look at reference [2]), a measure of convergence quality of the svm training algorithm . In theory it is 0 at the optimum and in reality at least close to 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print \"duality_gap\", dual_obj-primal_obj"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now apply on the X-Y grid data and plot the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "out=svm.apply(sg.RealFeatures(grid))\n",
    "z=out.get_values().reshape((size, size))\n",
    "\n",
    "#plot\n",
    "plt.jet()\n",
    "plt.figure(figsize=(16,6))\n",
    "plt.subplot(121)\n",
    "plt.title(\"Classification\")\n",
    "c=plt.pcolor(x1_, x2_, z)\n",
    "plt.contour(x1_ , x2_, z, linewidths=1, colors='black', hold=True)\n",
    "plt.colorbar(c)\n",
    "\n",
    "plt.gray()\n",
    "plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)\n",
    "plt.xlabel('attribute1')\n",
    "plt.ylabel('attribute2')\n",
    "plt.jet()\n",
    "\n",
    "z=out.get_labels().reshape((size, size))\n",
    "plt.subplot(122)\n",
    "plt.title(\"Decision boundary\")\n",
    "c=plt.pcolor(x1_, x2_, z)\n",
    "plt.contour(x1_ , x2_, z, linewidths=1, colors='black', hold=True)\n",
    "plt.colorbar(c)\n",
    "\n",
    "plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)\n",
    "plt.xlabel('attribute1')\n",
    "plt.ylabel('attribute2')\n",
    "plt.gray()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Probabilistic Outputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calibrated probabilities can be generated in addition to class predictions using `scores_to_probabilities()` method of [BinaryLabels](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CBinaryLabels.html), which implements the method described in [3]. This should only be used in conjunction with SVM. A parameteric form of a [sigmoid function](http://en.wikipedia.org/wiki/Sigmoid_function) $$\\frac{1}{{1+}exp(af(x) + b)}$$ is used to fit the outputs. Here $f(x)$ is the signed distance of a sample from the hyperplane, $a$ and $b$ are parameters to the sigmoid. This gives us the posterier probabilities $p(y=1|f(x))$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try this out on the above example. The familiar \"S\" shape of the sigmoid should be visible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "n=10\n",
    "x1t_=np.linspace(-5, 15, n)\n",
    "x2t_=np.linspace(-5, 15, n)\n",
    "xt, yt=np.meshgrid(x1t_, x2t_)\n",
    "#Generate X-Y grid test data\n",
    "test_grid=sg.RealFeatures(np.array((np.ravel(xt), np.ravel(yt))))\n",
    "\n",
    "labels_out=svm.apply(sg.RealFeatures(test_grid))\n",
    "\n",
    "#Get values (Distance from hyperplane)\n",
    "values=labels_out.get_values()\n",
    "\n",
    "#Get probabilities\n",
    "labels_out.scores_to_probabilities()\n",
    "prob=labels_out.get_values()\n",
    "\n",
    "#plot\n",
    "plt.gray()\n",
    "plt.figure(figsize=(10,6))\n",
    "p1=plt.scatter(values, prob)\n",
    "plt.title('Probabilistic outputs')\n",
    "plt.xlabel('Distance from hyperplane')\n",
    "plt.ylabel('Probability')\n",
    "plt.legend([p1], [\"Test samples\"], loc=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Soft margins and slack variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If there is no clear classification possible using a hyperplane, we need to classify the data as nicely as possible while incorporating the misclassified samples. To do this a concept of soft margin is used. The method introduces non-negative slack variables, $\\xi_i$, which measure the degree of misclassification of the data $x_i$.\n",
    "$$\n",
    "    y_i(\\mathbf{w}\\cdot\\mathbf{x_i} + b) \\ge 1 - \\xi_i \\quad 1 \\le i \\le N  $$\n",
    "\n",
    "Introducing a linear penalty function leads to \n",
    "$$\\arg\\min_{\\mathbf{w},\\mathbf{\\xi}, b } ({\\frac{1}{2} \\|\\mathbf{w}\\|^2 +C \\sum_{i=1}^n \\xi_i) }$$\n",
    "\n",
    "This in its dual form is leads to a slightly modified equation $\\qquad(2)$.\n",
    "\\begin{eqnarray*} \\max_{\\bf \\alpha} && \\sum_{i=1}^{N} \\alpha_i - \\sum_{i=1}^{N}\\sum_{j=1}^{N} \\alpha_i y_i \\alpha_j y_j k({\\bf x_i}, {\\bf x_j})\\\\ \\mbox{s.t.} && 0\\leq\\alpha_i\\leq C\\\\ && \\sum_{i=1}^{N} \\alpha_i y_i=0 \\\\ \\end{eqnarray*}\n",
    "\n",
    "The result is that soft-margin SVM could choose decision boundary that has non-zero training error even if dataset is linearly separable but is less likely to overfit."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's an example using LibSVM on the above used data set. Highlighted points show support vectors. This should visually show the impact of C and how the amount of outliers on the wrong side of hyperplane is controlled using it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def plot_sv(C_values):\n",
    "    plt.figure(figsize=(20,6))\n",
    "    plt.suptitle('Soft and hard margins with varying C', fontsize=12)\n",
    "    for i in range(len(C_values)): \n",
    "        plt.subplot(1, len(C_values), i+1)\n",
    "        linear_kernel=sg.LinearKernel(feats_train, feats_train)\n",
    "        svm1=sg.LibSVM(C_values[i], linear_kernel, labels)\n",
    "        svm1.train()\n",
    "        vec1=svm1.get_support_vectors()\n",
    "        X_=[]\n",
    "        Y_=[]\n",
    "        new_labels=[]\n",
    "        for j in vec1:\n",
    "            X_.append(traindata[0][j])\n",
    "            Y_.append(traindata[1][j])\n",
    "            new_labels.append(trainlab[j])\n",
    "        out1=svm1.apply(sg.RealFeatures(grid))\n",
    "        z1=out1.get_labels().reshape((size, size))\n",
    "        plt.jet()\n",
    "        c=plt.pcolor(x1_, x2_, z1)\n",
    "        plt.contour(x1_ , x2_, z1, linewidths=1, colors='black', hold=True)\n",
    "        plt.colorbar(c)\n",
    "        plt.gray()\n",
    "        plt.scatter(X_, Y_, c=new_labels, s=150)\n",
    "        plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=20)\n",
    "        plt.title('Support vectors for C=%.2f'%C_values[i])\n",
    "        plt.xlabel('attribute1')\n",
    "        plt.ylabel('attribute2')\n",
    "        \n",
    "        \n",
    "C_values=[0.1, 1000]\n",
    "plot_sv(C_values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that lower value of C causes classifier to sacrifice linear separability in order to gain stability, in a sense that influence of any single datapoint is now bounded by C. For hard margin SVM, support vectors are the points which are \"on the margin\". In the picture above, C=1000 is pretty close to hard-margin SVM, and you can see the highlighted points are the ones that will touch the margin. In high dimensions this might lead to overfitting. For soft-margin SVM, with a lower value of C, it's easier to explain them in terms of dual (equation $(2)$) variables. Support vectors are datapoints from training set which are are included in the predictor, ie, the ones with non-zero $\\alpha_i$ parameter. This includes margin errors and points on the margin of the hyperplane."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Binary classification using different kernels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Two-dimensional Gaussians are generated as data for this section.\n",
    "\n",
    "$x_-\\sim{\\cal N_2}(0,1)-d$\n",
    "\n",
    "$x_+\\sim{\\cal N_2}(0,1)+d$\n",
    "\n",
    "and corresponding positive and negative labels. We create traindata and testdata with ```num``` of them being negatively and positively labelled in traindata,trainlab and testdata, testlab. For that we utilize Shogun's Gaussian Mixture Model class ([GMM](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGMM.html)) from which we sample the data points and plot them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "num=50;\n",
    "dist=1.0;\n",
    "\n",
    "gmm=sg.GMM(2)\n",
    "gmm.set_nth_mean(np.array([-dist,-dist]),0)\n",
    "gmm.set_nth_mean(np.array([dist,dist]),1)\n",
    "gmm.set_nth_cov(np.array([[1.0,0.0],[0.0,1.0]]),0)\n",
    "gmm.set_nth_cov(np.array([[1.0,0.0],[0.0,1.0]]),1)\n",
    "\n",
    "gmm.set_coef(np.array([1.0,0.0]))\n",
    "xntr=np.array([gmm.sample() for i in xrange(num)]).T\n",
    "\n",
    "gmm.set_coef(np.array([0.0,1.0]))\n",
    "xptr=np.array([gmm.sample() for i in xrange(num)]).T\n",
    "\n",
    "traindata=np.concatenate((xntr,xptr), axis=1)\n",
    "trainlab=np.concatenate((-np.ones(num), np.ones(num)))\n",
    "\n",
    "#shogun format features\n",
    "feats_train=sg.RealFeatures(traindata)\n",
    "labels=sg.BinaryLabels(trainlab)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gaussian_kernel=sg.GaussianKernel(feats_train, feats_train, 10)\n",
    "#Polynomial kernel of degree 2\n",
    "poly_kernel=sg.PolyKernel(feats_train, feats_train, 2, True)\n",
    "linear_kernel=sg.LinearKernel(feats_train, feats_train)\n",
    "\n",
    "kernels=[gaussian_kernel, poly_kernel, linear_kernel]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#train machine\n",
    "C=1\n",
    "svm=sg.LibSVM(C, gaussian_kernel, labels)\n",
    "_=svm.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets plot the contour output on a $-5...+5$ grid for \n",
    "\n",
    "1. The Support Vector Machines decision function $\\mbox{sign}(f(x))$\n",
    "2. The Support Vector Machines raw output $f(x)$\n",
    "3. The Original Gaussian Mixture Model Distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "size=100\n",
    "x1=np.linspace(-5, 5, size)\n",
    "x2=np.linspace(-5, 5, size)\n",
    "x, y=np.meshgrid(x1, x2)\n",
    "grid=sg.RealFeatures(np.array((np.ravel(x), np.ravel(y))))\n",
    "grid_out=svm.apply(grid)\n",
    "z=grid_out.get_labels().reshape((size, size))\n",
    "\n",
    "plt.jet()\n",
    "plt.figure(figsize=(16,5))\n",
    "\n",
    "z=grid_out.get_values().reshape((size, size))\n",
    "\n",
    "plt.subplot(121)\n",
    "plt.title('Classification')\n",
    "c=plt.pcolor(x, y, z)\n",
    "plt.contour(x, y, z, linewidths=1, colors='black', hold=True)\n",
    "plt.colorbar(c)\n",
    "\n",
    "plt.subplot(122)\n",
    "plt.title('Original distribution')\n",
    "gmm.set_coef(np.array([1.0,0.0]))\n",
    "gmm.set_features(grid)\n",
    "grid_out=gmm.get_likelihood_for_all_examples()\n",
    "zn=grid_out.reshape((size, size))\n",
    "gmm.set_coef(np.array([0.0,1.0]))\n",
    "grid_out=gmm.get_likelihood_for_all_examples()\n",
    "zp=grid_out.reshape((size, size))\n",
    "z=zp-zn\n",
    "c=plt.pcolor(x, y, z)\n",
    "plt.contour(x, y, z, linewidths=1, colors='black', hold=True)\n",
    "plt.colorbar(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And voila! The SVM decision rule reasonably distinguishes the red from the blue points. Despite being optimized for learning the discriminative function maximizing the margin, the SVM output quality wise remotely resembles the original distribution of the gaussian mixture model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us visualise the output using different kernels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def plot_outputs(kernels):\n",
    "    plt.figure(figsize=(20,5))\n",
    "    plt.suptitle('Binary Classification using different kernels', fontsize=12)\n",
    "    for i in range(len(kernels)):\n",
    "        plt.subplot(1,len(kernels),i+1)\n",
    "        plt.title(kernels[i].get_name())\n",
    "        svm.set_kernel(kernels[i])\n",
    "        svm.train()\n",
    "        grid_out=svm.apply(grid)\n",
    "        z=grid_out.get_values().reshape((size, size))\n",
    "        c=plt.pcolor(x, y, z)\n",
    "        plt.contour(x, y, z, linewidths=1, colors='black', hold=True)\n",
    "        plt.colorbar(c)\n",
    "        plt.scatter(traindata[0,:], traindata[1,:], c=trainlab, s=35)\n",
    "\n",
    "plot_outputs(kernels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Kernel Normalizers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Kernel normalizers post-process kernel values by carrying out normalization in feature space. Since kernel based SVMs use a non-linear mapping, in most cases any normalization in input space is lost in feature space. Kernel normalizers are a possible solution to this. Kernel Normalization is not strictly-speaking a form of preprocessing since it is not applied directly on the input vectors but can be seen as a kernel interpretation of the preprocessing. The [CKernelNormalizer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKernelNormalizer.html) class provides tools for kernel normalization. Some of the kernel normalizers in Shogun:\n",
    "\n",
    "* [SqrtDiagKernelNormalizer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSqrtDiagKernelNormalizer.html) : This normalization in the feature space amounts to defining a new kernel $k'({\\bf x},{\\bf x'}) = \\frac{k({\\bf x},{\\bf x'})}{\\sqrt{k({\\bf x},{\\bf x})k({\\bf x'},{\\bf x'})}}$\n",
    "\n",
    "* [AvgDiagKernelNormalizer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CAvgDiagKernelNormalizer.html) : Scaling with a constant $k({\\bf x},{\\bf x'})= \\frac{1}{c}\\cdot k({\\bf x},{\\bf x'})$\n",
    "\n",
    "* [ZeroMeanCenterKernelNormalizer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CZeroMeanCenterKernelNormalizer.html) : Centers the kernel in feature space and ensures each feature must have zero mean after centering.\n",
    "\n",
    "The `set_normalizer()` method of [CKernel](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKernel.html) is used to add a normalizer.\n",
    "Let us try it out on the [ionosphere dataset](https://archive.ics.uci.edu/ml/datasets/Ionosphere) where we use a small training set of 30 samples to train our SVM. Gaussian kernel with and without normalization is used. See reference [1] for details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "f = open('../../../data/uci/ionosphere/ionosphere.data')\n",
    "mat = []\n",
    "labels = []\n",
    "# read data from file\n",
    "for line in f:\n",
    "    words = line.rstrip().split(',')\n",
    "    mat.append([float(i) for i in words[0:-1]])\n",
    "    if str(words[-1])=='g':\n",
    "        labels.append(1)\n",
    "    else:\n",
    "        labels.append(-1)    \n",
    "\n",
    "f.close()\n",
    "\n",
    "\n",
    "mat_train=mat[:30]\n",
    "mat_test=mat[30:110]\n",
    "\n",
    "lab_train=sg.BinaryLabels(np.array(labels[:30]).reshape((30,)))\n",
    "lab_test=sg.BinaryLabels(np.array(labels[30:110]).reshape((len(labels[30:110]),)))\n",
    "\n",
    "feats_train = sg.RealFeatures(np.array(mat_train).T)\n",
    "feats_test = sg.RealFeatures(np.array(mat_test).T)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#without normalization\n",
    "gaussian_kernel=sg.GaussianKernel()\n",
    "gaussian_kernel.init(feats_train, feats_train)\n",
    "gaussian_kernel.set_width(0.1)\n",
    "\n",
    "C=1\n",
    "svm=sg.LibSVM(C, gaussian_kernel, lab_train)\n",
    "_=svm.train()\n",
    "output=svm.apply(feats_test)\n",
    "\n",
    "Err=sg.ErrorRateMeasure()\n",
    "error=Err.evaluate(output, lab_test)\n",
    "print 'Error:', error\n",
    "\n",
    "#set normalization\n",
    "gaussian_kernel=sg.GaussianKernel()\n",
    "# TODO: currently there is a bug that makes it impossible to use Gaussian kernels and kernel normalisers\n",
    "# See github issue #3504\n",
    "#gaussian_kernel.set_normalizer(sg.SqrtDiagKernelNormalizer())\n",
    "gaussian_kernel.init(feats_train, feats_train)\n",
    "gaussian_kernel.set_width(0.1)\n",
    "\n",
    "svm.set_kernel(gaussian_kernel)\n",
    "svm.train()\n",
    "output=svm.apply(feats_test)\n",
    "\n",
    "Err=sg.ErrorRateMeasure()\n",
    "error=Err.evaluate(output, lab_test)\n",
    "print 'Error with normalization:', error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiclass classification "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Multiclass classification can be done using SVM by reducing the problem to binary classification. More on multiclass reductions in [this notebook](http://www.shogun-toolbox.org/static/notebook/current/multiclass_reduction.html). [CGMNPSVM](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGMNPSVM.html) class provides a built in [one vs rest multiclass](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CMulticlassOneVsRestStrategy.html) classification using [GMNPlib](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGMNPLib.html). Let us see classification using it on four classes. [CGMM](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGMM.html) class is used to sample the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "num=30;\n",
    "num_components=4\n",
    "means=np.zeros((num_components, 2))\n",
    "means[0]=[-1.5,1.5]\n",
    "means[1]=[1.5,-1.5]\n",
    "means[2]=[-1.5,-1.5]\n",
    "means[3]=[1.5,1.5]\n",
    "\n",
    "covs=np.array([[1.0,0.0],[0.0,1.0]])\n",
    "\n",
    "gmm=sg.GMM(num_components)\n",
    "[gmm.set_nth_mean(means[i], i) for i in range(num_components)]\n",
    "[gmm.set_nth_cov(covs,i) for i in range(num_components)]\n",
    "gmm.set_coef(np.array([1.0,0.0,0.0,0.0]))\n",
    "xntr=np.array([gmm.sample() for i in xrange(num)]).T\n",
    "xnte=np.array([gmm.sample() for i in xrange(5000)]).T\n",
    "gmm.set_coef(np.array([0.0,1.0,0.0,0.0]))\n",
    "xntr1=np.array([gmm.sample() for i in xrange(num)]).T\n",
    "xnte1=np.array([gmm.sample() for i in xrange(5000)]).T\n",
    "gmm.set_coef(np.array([0.0,0.0,1.0,0.0]))\n",
    "xptr=np.array([gmm.sample() for i in xrange(num)]).T\n",
    "xpte=np.array([gmm.sample() for i in xrange(5000)]).T\n",
    "gmm.set_coef(np.array([0.0,0.0,0.0,1.0]))\n",
    "xptr1=np.array([gmm.sample() for i in xrange(num)]).T\n",
    "xpte1=np.array([gmm.sample() for i in xrange(5000)]).T\n",
    "traindata=np.concatenate((xntr,xntr1,xptr,xptr1), axis=1)\n",
    "testdata=np.concatenate((xnte,xnte1,xpte,xpte1), axis=1)\n",
    "\n",
    "l0 = np.array([0.0 for i in xrange(num)])\n",
    "l1 = np.array([1.0 for i in xrange(num)])\n",
    "l2 = np.array([2.0 for i in xrange(num)])\n",
    "l3 = np.array([3.0 for i in xrange(num)])\n",
    "\n",
    "trainlab=np.concatenate((l0,l1,l2,l3))\n",
    "testlab=np.concatenate((l0,l1,l2,l3))\n",
    "\n",
    "plt.title('Toy data for multiclass classification')\n",
    "plt.jet()\n",
    "plt.scatter(traindata[0,:], traindata[1,:], c=trainlab, s=75)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "feats_train=sg.RealFeatures(traindata)\n",
    "labels=sg.MulticlassLabels(trainlab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us try the multiclass classification for different kernels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gaussian_kernel=sg.GaussianKernel(feats_train, feats_train, 2)\n",
    "poly_kernel=sg.PolyKernel(feats_train, feats_train, 4, True)\n",
    "linear_kernel=sg.LinearKernel(feats_train, feats_train)\n",
    "\n",
    "kernels=[gaussian_kernel, poly_kernel, linear_kernel]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "svm=sg.GMNPSVM(1, gaussian_kernel, labels)\n",
    "_=svm.train(feats_train)\n",
    "\n",
    "size=100\n",
    "x1=np.linspace(-6, 6, size)\n",
    "x2=np.linspace(-6, 6, size)\n",
    "x, y=np.meshgrid(x1, x2)\n",
    "grid=sg.RealFeatures(np.array((np.ravel(x), np.ravel(y))))\n",
    "def plot_outputs(kernels):\n",
    "    plt.figure(figsize=(20,5))\n",
    "    plt.suptitle('Multiclass Classification using different kernels', fontsize=12)\n",
    "    for i in range(len(kernels)):\n",
    "        plt.subplot(1,len(kernels),i+1)\n",
    "        plt.title(kernels[i].get_name())\n",
    "        svm.set_kernel(kernels[i])\n",
    "        svm.train(feats_train)\n",
    "        grid_out=svm.apply(grid)\n",
    "        z=grid_out.get_labels().reshape((size, size))\n",
    "        plt.pcolor(x, y, z)\n",
    "        plt.contour(x, y, z, linewidths=1, colors='black', hold=True)\n",
    "        plt.colorbar(c)\n",
    "        plt.scatter(traindata[0,:], traindata[1,:], c=trainlab, s=35)\n",
    "\n",
    "plot_outputs(kernels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The distinguishing properties of the kernels are visible in these classification outputs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[1] Classification in a Normalized Feature Space Using Support Vector Machines - Arnulf B. A. Graf, Alexander J. Smola, and Silvio Borer - IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003\n",
    "\n",
    "[2] Boyd, Stephen P.; Vandenberghe, Lieven (2004). Convex Optimization ([pdf](http://www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf)). Cambridge University Press. ISBN 978-0-521-83378-3. Retrieved October 15, 2011.\n",
    "\n",
    "[3] Lin, H., Lin, C., and Weng, R. (2007). A note on Platt's probabilistic outputs for support vector machines."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
